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Abstract 

We give an efficient algorithm for finding sparse approximate solutions to linear systems of 
equations with nonnegative coefficients. Unlike most known results for sparse recovery, we do not 
require any assumption on the matrix other than non-negativity. Our algorithm is combinatorial 
in nature, inspired by techniques for the set cover problem, as well as the multiplicative weight 
update method. 

We then present a natural application to learning mixture models in the PAC framework. For 
learning a mixture of k axis-aligned Gaussians in d dimensions, we give an algorithm that outputs 
a mixture of 0(k/e 3 ) Gaussians that is e-close in statistical distance to the true distribution, 
without any separation assumptions. The time and sample complexity is roughly 0(kd/e 3 ) d . 
This is polynomial when d is constant - precisely the regime in which known methods fail to 
identify the components efficiently. 

Given that non-negativity is a natural assumption, we believe that our result may find use 
in other settings in which we wish to approximately explain data using a small number of a 
(large) candidate set of components. 


1 Introduction 


Sparse recovery, or the problem of finding sparse solutions (i.e., solutions with a few non-zero 
entries) to linear systems of equations, is a fundamental problem in signal processing, machine 
learning and theoretical computer science. In its simplest form, the goal is to find a solution to a 
given system of equations Ax = b that minimizes ||x|| 0 (which we call the sparsity of x ). 

It is known that sparse recovery is NP hard in general. It is related to the question of finding 
if a set of points in ri-dimension al space are in general position - i.e., they do not lie in any 
(d - 1) dimensional subspace fKhachivanl . 1199.4 A strong negative r esult in the same vein is 


due to Arora et al. 1990l ] and (independently) lAmaldi and Karm 1998 ], who prove that it is not 


possible to approximate the quantity min{||x|| 0 : Ax = b} to a factor better than 2( logn ) 1/ ~ unless 
NP has quasi polynomial time algorithms. 

While these negative results seem forbidding, there are some instances in which sparse recovery 
is possible. S parse recovery is a b a sic probl em in the field of compressed sensing, and in a beautiful 


line of work. iCandes et all 20061 ], Donoho 20061 ] and others show that convex relaxations can be 


used for sparse recovery when the matrix A has certain structural properties, such as incoherence, 
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or the so-called restricted isometry property (RIP). However the focus in compressed sensing is 
to design matrices A (with as few rows or ‘measurements’ as possible) that allow the recovery of 
sparse vectors x given Ax. Our focus is instead on solving the spa rse recovery problem for a given 
A,b, similar to that of [Nataraian . 1995| . Donoho and Elad. 2floii ] . In general, checking if a given 


A possesses the RIP is a hard problem [Bandeira e t al 


201 


r 


Motivated by the problem of PAC learning mixture models (see below), we consider the sparse 
recovery problem when the matrix A, the vector b, and the solution we seek all have non-negative 
entries. In this case, we prove that approximate sparse recovery is always possible, with some loss 
in the sparsity. We obtain the following trade-off: 

Theorem 1.1. (Informal) Suppose the matrix A and vector b have non-negative entries, and 
suppose there exists a k-sparsqj non-negative x* such that Ax* = b. Then for any e > 0, there is an 
efficient algorithm that produces an x a i g that is 0(k/e 3 ) sparse, and satisfies — 6)^ < e ||&|| x . 

The key point is that our upper bound on the error is in the I\ norm (which is the largest 
among all I v norms). Indeed the trade-off between the sparsity of the obtained solution and the 
error is much better understood if the error is measured in the 1 2 norm. In this case, the natural 
greedy ‘coordinate ascent’, as well as the algori t hm based on sampling from a “dense” solution 


give non-trivial guarantees (see Nataraian 1995 ]. Shalev-Shwartz et al. 2010[ ]1. If the columns of 


A are normalized to be of unit length, and we seek a solution x with = 1, one can find an x' 
such that || Ax' — &|| 2 < e and x' has only 0( log ^^ ) non-zero co-ordinates. A similar bound can 
be obtained f or general convex opt i mizati on problems, under strong convexity assumptions on the 
loss function (Shalev-Shwartz et al. . 20ld |. 

While these methods are powerful, they do not apply (to the best of our knowledge) when 
the error is measured in the I\ norm, as in our applications. More importantly, they do not take 
advantage of the fact that there exists a fc-sparse solution (without losing a factor that depends on 


m 


Shalev-Shwartz et al. 20ld ]h 


th e largest eigenvalue of A^ as in iNataraianl 19951 ] , or without additional RIP style assumptions as 


The second property of our result is that we do not rely on the uniqueness of the solution (as is 
the case with approaches based on convex optimization). Our algorithm is more combinatorial in 
nature, and is inspired by multiplicative weight update based algorithms for the set cover problem, 
as described in Section [2l Finally, we remark that we do not need to assume that there is an 
“exact” sparse solution (i.e., Ax* = b ), and a weaker condition suffices. See Theorem 12.11 for the 
formal statement. 

Are there natural settings for the sparse recovery problem with non-negative A , 6? One appli¬ 
cation we now describe is that of learning mixture models in the PAC framework V aliant . 19841 . 


Kearns et al., 1991. 


Learning mixture models 

A common way to model data in learning applications is to view it as arising from a “mixture model” 
with a small number of parameters. Finding the parameters often leads to a better understanding 
of the data. The par adigm has been appl i ed wi t h a lot of success to data in speech, docu ment clas¬ 
sification, and so on [Reynolds and Rosel . Il995l . iTitterington et all Il985l . iLindsavl. Il995l |. Learning 


1 I.e., has at most k nonzero entries. 
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algorithms for Gaussian mixtures, hidden Markov models, topic models for documents, etc. have 
received wide attention both in theory and practice. 

In this paper, we consider the problem of learning a mixture of Gaussians. Formally, given 
samples from a mixture of k Gaussians in d dimensions, the goal is to recover the components with 
high probability. The problem is extremely well studied, starting with the ear ly heuristic methods 
such as expectation-maximization (EM). The celebrated result of IDasguptal 1999 ] gave the first 
rigorous algorithm to recover mixture components, albeit under a separation a ssumption. This was 


then i mproved in several subsequent works (c.f. lArora and Kannanl 20011 ] . IVempala and Wang 


men impr ove d in several subse quent w 
20021] . Daseupta and Schulman [2007 ]]). 


More recently, by a novel use of the classical method of moments. iKalai et al.1 20101 ] and lBelkin and Sinha 


201 ot ] showed that any d-dimensional Gaussian mixture with a constant number of components k 
can be recovered in polynomial time (without any s trong separation). However the dependence on 


k in these works is exponential. Moitra and Valiant 2010] showed that this is necessary if we wish 
to recover the true components, even in one dimension. 


In a rather s urpris ing direction, iHsu and Kakadel 201.11 ]. and later iBhaskara et al 


Anderson et al . [20141 ] showed that if the dimension d is large (at least k c 


2014 ] and 
for a constant c > 

0), then tensor methods yield polynomial time algorithms for parameter recovery, under mild 
non-degeneracy assumptions. Thus the case of small d and much larger k seems to be the most 
challenging for current techniques, if we do not have separation assumptions. Due to the lower 
bound mentioned above, we cannot hope to recover the true parameters used to generate the 
samples. 

Our parameter setting. We consider the case of constant d, and arbitrary k. As mentioned 
earlier, this case has samp l e complex ity exponential in k if we wish to recover the true components 
of the mi xture (Moitra and Valiantl 20101 ]). We thus consider the corresponding PAC learning 
[l984l ]~ e, 5 > 0 and samples from a mixture of Gaussians 


question (Valiant 


as above, can we find a mixture of k Gaussians such that the statistical distance to the original 
mixture is < e with success probability (over samples) > (1 — <5)? 

Proper vs improper learning. The question stated above is usually referred to as proper learn¬ 
ing: given samples from a distribution / in a certain class (in this case a mixture of k Gaussians), 

we are required to output a distribution / in the same class, such that / — / < e. A weaker 

1 

notion that is often studied is improper learning, in which / is allowed to be arbitrary (it some 
contexts, it is referred to as density estimation). 

Proper learning is often much harder than improper learning. To wit, the best known algo¬ 
rithms for proper learning of Gaussian mixtures run in time exponential in k. It was first studied 
by Feldm an et al. [20061] . who gave an algorithm with sample complexity polynomial in k,d , but 
ru n time exponential in k. Later works improved t he sam ple complexity, culminating in the works 


by Daskalakis and Kamath 2014]], Acharva et al 


20141 ]. who gave algorithms with optimal sam- 
We note that even here, the run times are 


pie complexity, for the case of spherical Gaussians 
poly(d, 1/ e) k . 

Meanwhile for improper learning, there are efficient algorithms known for learning mixtures 
of very general one dimensional distr ibu tion s (monotone, unimodal, log-concave, and so on). A 
sequence of works by Chan et al. 20131 . 2014 ] give algorithms that have near-optimal sample com¬ 
plexity (of 0(k/e 2 )), and run in polynomial time. However it is not known how well these methods 
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extend to higher dimensions. 

In this paper we consider something in between proper and improper learning. We wish to 
return a mixture of Gaussians, but with one relaxation: we allow the algorithm to output a mixture 
with slightly more than k components. Specifically, we obtain a tradeoff between the number of 
components in the output mixture, and the distance to the original mixture. Our theorem here is 
as follows 

Theorem 1.2. (Informal) Suppose we are given samples from a mixture ofk axis-aligned Gaussians 
in d dimensions. There is an algorithm with running time and sample complexity O ^ 

and outputs a mixture of 0(k/e 3 ) axis-aligned Gaussians which is e-close in statistical distance to 
the original mixture, with high probability. 

The algorithm is an application of our result on solving linear systems. Intuitively, we consider 
a matrix whose columns are the probability density functions (p.d.f.) of all possible Gaussians in 
M. d , and try to write the p.d.f. of the given mixture as a sparse linear combination of these. To 
obtain finite bounds, we require careful discretization, which is described in Section [3l 

Is the trade-off optimal? It is natural to ask if our tradeoff in Theorem 1 1.1 1 is the best possible 
(from the point of view of efficient algorithms). We conjecture that the optimal tradeoff is /c/e 2 , 
up to factors of 0(log(l/e)) in general. We can prove a weaker result, that for obtaining an 
e approximation in the I\ norm to the general sparse recovery problem using polynomial time 
algorithms, we cannot always get a sparsity better than A;log(l/e) unless V = MV. 

While this says that some dependence on e is necessary, it is quite far from our algorithmic bound 
of 0(k/e 3 ). In Section^ we will connect this to similar disparities that exist in our understanding 
of the set cover problem. We present a random planted version of the set cover problem, which 
is beyond all known algorithmic techniques, but for which there are no known complexity lower 
bounds. We show that unless this planted set cover problem can be solved efficiently, we cannot 
hope to obtain an e-approximate solution with sparsity o(k/e 2 ). This suggests that doing better 
than k/e 2 requires significantly new algorithmic techniques. 

1.1 Basic notation 

We will write for the set of non-negative reals. For a vector x, its ith co-ordinate will be denoted 
by Xi, and for a matrix A, A j denotes the ith column of A. For vectors x, y , we write x < y to mean 
entry-wise inequality. We use [n] to denote the set of integers {1,2,... ,n}. For two distributions 
p and q, we use \\p — to denote the I\ distance between them. 

2 Approximate sparse solutions 

2.1 Outline 

Our algorithm is inspired by techniques for the well-known set cover problem: given a collection of 
n sets S\, S< 2 , ■ ■ ■, S n C [m], find the sub-collection of the smallest size that covers all the elements 
of [m\. In our problem, if we set A to be the indicator vector of the set Si, and b to be the vector 
with all entries equal to one, a sparse solution to Ax = b essentially covers all the elements of [ m] 
using only a few sets, which is precisely the set cover problem. The difference between the two 
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problems is that in linear equations, we are required to ‘cover’ all the elements precisely once (in 
order to have equality), and additionally, we are allowed to use sets fractionally. 

Motivated by this connection, we define a potential function which captures the notion of 
covering all the elements “equally”. For a vector x € M™, we define 

$(s) :=J2 bj{\ + 8f Ax) ^ b i (1) 

3 


This is a mild modification of th e potential function used in the multiplicative weight update 


method ( Freund and Schapire 1997 ]. Arora et al. 2012l ]h Suppose for a moment that 


l — 1- 

Now, consider some x with ||a;|| 1 = 1. If ( Ax)j = bj for all j, the potential <h(x) would be precisely 
(1 + 5). On the other hand, if we had ( Ax)j/bj varying significantly for different j, the potential 
would (intuitively) be significantly larger; this suggests an algorithm that tries to increment x 
coordinate-wise, while keeping the potential small. Since we change x coordinate-wise, having a 
small number of iterations implies sparsity. The key to the analysis is to prove that at any point in 
the algorithm, there is a “good” choice of coordinate that we can increment so as to make progress. 
We now make this intuition formal, and prove the following 


Theorem 2.1. Let A be an m x n non-negative matrix, and b € M m be a non-negative vector. 
Suppose there exists ak-sparse non-negative vector x* such that ||Aa;*|| 1 = H&l^ and Ax* < (l+eo)6, 
for some 0 < eo < 1/16. Then for any e > 16eo, there is an efficient algorithm that produces an 
x a i g that is 0{k/e 3 ) sparse, and satisfies \\Ax a i g — 6^ < e ||6)^. 


Normalization 

For the rest of the section, m, n will denote the dimensions of A, as in the statement of Theorem l2.il 
Next, note that by scaling all the entries of A, b appropriately, we can assume without loss of 
generality that ||6|| x = 1. Furthermore, since for any i, multiplying Xi by c while scaling Aj by (1/c) 
maintains a solution, we may assume that for all i, we have 11^4.* 11 x = 1 (if Aj = 0 to start with, we 
can simply drop that column). Once we make this normalization, since A, b are non-negative, any 
non-negative solution to Ax = b must also satisfy ||a;|| 1 = 1. 


2.2 Algorithm 

We follow the outline above, having a total of 0(/c/e 3 ) iterations. At iteration t, we maintain a 
solution x®, obtained by incrementing precisely one co-ordinate of x^ 1 ' 1 . We start with x ® = 0; 
thus the final solution is 0(fc/e 3 )-sparse. 

We will denote y® := (Ax®). Apart from the potential <h introduced above (Eq.([T])), we keep 
track of another quantity: 

^(®) := Y^{Ax)j. 

3 

Note that since the entries of A,x are non-negative, this is simply HAx^. 


Running time. Each iteration of the algorithm can be easily implemented in time 0(mn log (mn)/5) 
by going through all the indices, and for each index, checking for a 0 in multiples of (1 + 5). 

Note that the algorithm increases if(x®) by at least 1/Ck in every iteration (because the 
increase is precisely 9, which is > 1/Ck), while increasing d> as slowly as possible. Our next lemma 
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procedure solved^, b, k, e}) 


// 

find an e-approximate solution. 

begin 

l 

Initialize x= 0; set parameters T = Ck/5 2 \ C = 16/e; 5 = e/16. 

for t = 0,... , T — 1 do 

2 


Find a coordinate i and a scaling 9 > 0 such that 9 > l/Ck, and the ratio 
<h(am) + 9ei)/<&{x bt >) is minimized. 

3 


Set ad +1 ) = x® + 9ei. 


end 

4 

Output £ a i g = a/*)/ ||xW|| r 

end 



says that once if) is large enough (while having a good bound on d>), we can get a “good” solution. 
I.e., it connects the quantities and if to the i\ approximation we want to obtain. 


Lemma 2.2. Let x € W 1 satisfy the condition <h(x) < (1 + S)^ 1+T1 ^ x \ fo r some rj > 0. 
have 


(Ax) 

if(x) 


< 2 



Then we 

(2) 


Proof. For convenience, let us write y = Ax, and y = (i.e., the normalized version). Note that 

since each column of A has unit l\ norm, we have ip{x) = HAt^ = Hyl^. Since y and b are both 
normalized, we have 

\\y- b \\i = 2 - 

j ■■ Vj>bj 

From now on, we will denote S := {j : yj > bj}, and write p := J2j^s b j- Thus to prove the 
lemma, it suffices to show that 


~ b i)^ 

i&s 



(3) 


Now, note that the LHS above can be written as YljeS b j (f 1 — !)• We then have 

(1+<5)W 1 ' Ej ' 6s6 ^?‘“ 1 ) < (1 + 5 

< A . _|_ , 5 ) ( b j ^ A) (convexity) 

jes p 

< - • ^2 b j{ 1 + <5) ^ b i (sum over all j ) 

P j 

< - • $(x) • (1 + 5 )~^ 

P 

< - • (1 + S) r] ' , ^ x ' > (hypothesis on <h). 

P 
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Thus taking logarithms (to base (1 + 5)), we can bound the LHS of Eq.([3]) by 


Iftxj 1og(i+5)(l/j>) +mi< + V- 

The last inequalities are by using the standard facts that ln(l+5) > 5/2 (for 5 < 1), and pln(l/p) < 
(1/e) for any p, and since p < 1. This shows Eq. (J3]), thus proving the lemma. □ 

The next lemma shows that we can always find an index i and a 9 as we seek in the algorithm. 

Lemma 2.3. Suppose there exists a k-sparse vector x* such that <h(x*) < (1 + 5)( 1+e °). Then for 
any C > 1, and any x® € M n , there exists an index i, and a scalar 6 > l/(Ck), such that 

§(x® + 6ei) < (1 + 5) e(1+eo)(1+<5) / (1 ” (1/c ' )) $(x W ). 

Proof, x* is fc-sparse, so we may assume w.l.o.g., that x* = 0\e\ + 02e2 + • • • + Ok&k■ Let us define 

A i = <h(x^ + did) - 4>(x^). 


First, we will show that 


k 

Y Ai < $(x w ) [(1 + 5) (1+eo) - 1]. (4) 

i =1 

To see this, note that the LHS of Eq.([U) equals 


( k 

Y(1 + 5)( A ( 3;(t)+e ' iei ))j/+ - (1 + S)( Ax(t) )j/ b i 
i= 1 

( k N 

+ 5){ A ^i e i))P b 3 - 1 
i=l / 

<Y b j( 1 + 5) (Ax(t))j/6j ((1 + 5)^*^ - l) . 

3 


In the last step, we used the fact that the function /(f) := (l + 5) f — 1 is sub-additive (Lemma I A. 1(1 . 
and the fact that x* = Yli=i ®i e i- Now, using the bound we have on 4>(x*), we obtain Eq. Q. The 
second observation we make is that since ||x*|| 1 = 1, we have )+- 6/ = 1. 

Now we can apply the averaging lemma lA. 2 1 with the numbers {Aj,f/}/ =1 , to conclude that for 
any C > 1, there exists an i € [/c] such that 9{ > 1 /(Ck), and 


A i < <h(x w ) • 


(1 + 5)L+ £ o) _ i 

1-(1 /C) 


Thus we have that for this choice of i, and 0 = 9{, 


4>(x (t) + Oef) < 4>(x (t) ) 



(1 + 5 )( 1+e °) - l\ 

i - a/c) ) • 
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Now we can simplify the term in the parenthesis using Lemma IA.3I (twice) to obtain 


i , „ (1 + <5) (1+eo) - 1 _ , 6-S{ l + eo) 

1 + > ' i-(i to ~ 1+ i-d/c) 

< ^ _)_ ^e(l+eo)(l+5)/(l-(l/C)) 


This completes the proof of the lemma. □ 

Proof of Theorem \2.1\ By hypothesis, we know that there exists an x* such that ||Aa;*|| 1 = 1 (or 
equivalently ||a;*|| 1 = 1) and Ax* < (1 + x*)b. Thus for this x*, we have $(x*) < (1 + <5)( 1+e °), so 
Lemma [ 2.31 shows that in each iteration, the algorithm succeeds in finding an index i and 0 > 1/Ck 
satisfying the conclusion of the lemma. Thus after T steps, we end up with Tp(x ^) > l/<5 2 , thus 
we can appeal to Lemma [ 2 ~ 2 l Setting r/ := 2(eo + 5 + 1/C), and observing that 

(1 + e o)(l + <5)/(l — 1/C) < 1 + r), 

the lemma implies that the t\ error is < 2 (77 + 6) < e, from our choice of rj, 5. This completes the 
proof. □ 

Remark 2.4. The algorithm above finds a column to add by going through indices i € [m], and 
checking if there is a scaling of Ai that can be added. But in fact, any procedure that allows us to 
find a column with a small value o/4>(x^ +1 ^)/ < h(x^l) would suffice for the algorithm. For example, 
the columns could be parametrized by a continuous variable, and we may have a procedure that 
only searches over a discretization H We could also have an optimization algorithm that outputs the 
column to add. 


3 Learning Gaussian mixtures 

3.1 Notation 

Let N(n,t t 2 ) denote the density of a d-dimensional axis-aligned Gaussian distribution with mean 
p, and diagonal covariance matrix <r 2 respectively. Thus a ^-component Gaussian mixture has the 
density Ylr=i w r N(fj, r , off). We use / to denote the underlying mixture and p r to denote component 
r. We use (') to denote empirical or other estimates; the usage becomes clear in context. For an 
interval I, let |/| denote its length. For a set S, let n(S) be the number of samples in that set. 

3.2 Algorithm 

The problem for finding components of a fc-component Gaussian mixture / can be viewed as finding 
a sparse solution for system of equations 


Aw = /, (5) 

where columns of A are the possible mixture components and w is the weight vector and / is the 
density of the underlying mixture. If / is known exactly, and A is known explicitly, © can be 
solved using solve({A, /, k, e}). 

2 This is a fact we will use in our result on learning mixtures of Gaussians. 







However, a direct application of solve has two main issues. Firstly, / takes values over M. d and 
thus is an infinite dimensional vector. Thus a direct application of solve is not computationally 
feasible. Secondly / is unknown and has to be estimated using samples. Also, for algorithm solve’s 
performance guarantees to hold, we need an estimate / such that f(x) > f(x)( 1 — e), for all x. 
This kind of a global multiplicative condition is difficult to satisfy for continuous distributions. To 
avoid these issues, we carefully discretize the mixture of Gaussians. More specifically, we partition 
W l into rectangular regions S = {Si, S 2 , ■ ■ •} such that Si n Sj = 0 and UsesS = M rf . Furthermore 
we flatten the Gaussian within each region to induce a new distribution over M. d as follows: 

Definition 3.1. For a distribution p and a partition S, the new distribution p s is defined ajl 

• If x,y G S for some S € S, then p s {x) = p s (y) 

• VSeS,p(S) =p s {S). 

Note that we use the standard notation that p(S) denotes the total probability mass of the 
distribution p over the region S. Now, let A s be a matrix with rows indexed by S € S and columns 
indexed by distributions p such that A s (S,p) = p(S). A s is a matrix with potentially infinitely 
many columns, but finitely many rows (number of regions in our partition). 

Using samples, we generate a partition of M. d such that the following properties hold. 

1. f s {S) can be estimated to sufficient multiplicative accuracy for each set S € S. 

2. If we output a mixture of 0(k/e 3 ) Gaussians A s w' such that YlseS KA^u/XS) ~ / 5 (£')| is 
small, then \\Aw' — f\h is also small. 

For the first one to hold, we require the sets to have large probabilities and hence requires S to 
be a coarse partition of M rf . The second condition requires the partition to be ‘fine enough’, that 
a solution after partitioning can be used to produce a solution for the corresponding continuous 
distributions. How do we construct such a partition? 

If all the Gaussian components have similar variances and the means are not too far apart, 
then a rectangular grid with carefully chosen width would suffice for this purpose. However, since 
we make no assumptions on the variances, we use a sample-dependent partition (i.e., use some of 
the samples from the mixture in order to get a rough estimate for the ‘location’ of the probability 
mass). To formalize this, we need a few more definitions. 

Definition 3.2. A partition of a real line is given by I = {I\, I 2 , ■ ■ ■} where fis are continuous 
intervals, It n If = and U i £ zl = R. 

Since we have d dimensions, we have d such partitions. We denote by X{ the partition of axis i. 
The interval t of coordinate i is denoted by I{.t- 

For ease of notation, we use subscript r to denote components (of the mixture), i to denote 
coordinates (1 < i < d), and t to denote the interval indices corresponding to coordinates. We now 
define induced partition based on intervals and a notion of “good” distributions. 

Definition 3.3. Given partitions X\,X 2 ,X$,... Xd for coordinates 1 to d, define I\,l 2 ,Iz, ■ ■ -X^- 
induced partition S = (S),} as follows: for every d-tuple v, x G S v iff Xi € Ii tVi \/v. 

3 We are slightly abusing notation, with p s denoting both the p.d.f. and the distribution itself. 
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Definition 3.4. A product distribution p = p\ x p 2 x ... x pd is (Zi,Z 2 ,Z 3 ,.. .Z d),e-good if for 
every coordinate i and every interval Ii.t, Pi(P,,t) < e. 


Intuitively, e-good distributions have small mass in every interval and hence binning it would 
not change the distribution by much. Specifically in Lemma [3. 8 1 we show that for such distributions 
||p —p 5 ||i is bounded. 

We now have all the tools to describe the algorithm. Let ei = e 3 /kd. The algorithm first 
divides into a rectangular gridded fine partition S with ~ ef d bins such that most of them have 


probability > . We then group the bins with probability < e j to create a slightly coarser 

partition S'. The resulting S' is coarse enough that f s can be estimated efficiently, and is also fine 
enough to ensure that we do not lose much of the Gaussian structure by binning. 

We then limit the columns of A 5 to contain only Gaussians that are (Zi,Z 2 ,... Z^), 2 e 2 /d- 
good. In Lemma 13.81 we show that for all of these we do not lose much of the Gaussian structure 
by binning. Thus solve(vl‘ s w, b, k, e) yields us the required solution. With these definitions in 
mind, the algorithm is given in Learn({(xi,.. ■ X 2 n ), k, e}). Note that the number of rows in A s is 
|<S'| < |<S| = ef d . 

We need to bound the time complexity of finding a Gaussian in each iteration of the algorithm 
(to apply Remark 12.41) . To this end we need to find a finite set of candidate Gaussians (columns 
of A s ) such that running solve using a matrix restricted to these columns (call it Hg nite ) finds 
the desired mixture up to error e. Note that for this, we need to ensure that there is at least one 
candidate (column of ^4f nite ) that is close to each of the true mixture components. 

We ensure this as follows. Obtain a set of n' samples from the Gaussian mixture and for each 
pair of samples x, y consider the Gaussian whose mean is x and t he variance alo ng coo rdinate i is 
(xi — Vi) 2 - Similar to the proof of the one-dimensional version in lAcharva et ajj 20141 ], it follows 
that for any e' choosing n' > n((e / ) _rf ), this set contains Gaussians that are e' close to each of the 
underlying mixture components. For clarity of exposition, we ignore this additional error which 
can be made arbitrarily small and we treat e' as 0 . 


3.3 Proof of correctness 

We first show that b satisfies the necessary conditions for solve that are given in Theorem 12.II The 
proof follows from Chernoff bound and the fact that empirical mass in most sets S G S' is > efe. 

Lemma 3.5. If n> log then with probability >1 — 4 

e 1 e de 1 

VS G S',b(S) > f s '(S)(l — 3e), 

and J2ses'\f S '( S ) ~ b (S)\ < 6e. 

Proof. Let f s ' be the empirical distribution over S'. Since |X,;| = the induced partition S 
satisfies |<S| < Hence by the Chernoff and union bounds, for n > -^-logT^d, with probability 
>1-4, 

I f(S) - f s \S) | < y/fS'(S)e*e*/ 2 + efe 3 /2, VS G S. (6) 
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1 

2 

3 

4 

5 

6 


procedure Learn ({(xi,... X 2 n ), k, e}) 

begin 

Set parameter ei = e 3 /(kd). 

Use first n samples to find I\,l 2 ■ ■ - Id. such that number of samples x such that Xi 
is ne i- Let S be the corresponding induced partition. 

Use the remaining n samples to do: 

Let U = US',, : n(S v ) < nefe. 

Let S' = {U} U {5 G S : n(S v ) > nefe}. 

Set b(U) = 2e and 


VS €S'\{U},b{S) 


(1 — 2 e)n(S) 
J2 sgS'\{u} n(S) 


€ I t ,t 


7 


8 

9 


Let A s ' be the matrix with columns corresponding to distributions p that are 
(li,1 2 ,... Id), 2e 2 /d-good axis-aligned Gaussians, and ^4fn ite be the candidates obtained 
as above, using e' = ei/10. 
solve( J 4g nite , b, k, 64e) using Remark 12.41 
Output the w. 
end 


For the set U, 

f s \u) = E f s 'w 

S:f s '(S)<e d e 

< e + E \Jf S '(S)efe 3/2 + E 4 ^/2 

SeS SeS 

<2e, 

where the second inequality follows from the concavity of Jx. However b(U) = 2e and hence 
b{U)>f s \U){ l-2e). 

By Equation flUJ, 

I f\S) ~ f S \S) | < ^f s \S)efe 3/2 + efe 3 /2 

<yJf s '(S)efe 3 /2 + eie 3 /2 
<f s \S)(^/2 + e 2 /2) 

< f S \S)e. 

The penultimate inequality follows from the fact that f s ‘ (S) > efe. Hence f s '(S) > f s ' (S)( 1 — e). 
Furthermore by construction b(S) > f s '(S)( 1 — 2e). Hence b(S) > f s '(S)(l — 3 e)VS € S'. 
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For the second part of the lemma observe that b and f are distributions over S'. Hence 


Y J \KS)-f S '{S)\=2 Y, f S \S)-b{S) 

S&S S-.b{S)<f s \S) 


<2^/ <s '(S)-3e = 6e. 
SeS 


□ 


Using the above lemma, we now prove that Learn returns a good solution such that 
0 (e). 


f S ~f 6 


< 


Lemma 3.6. Let n > max 


2 2d 8 

l e ue l 


I? log 


■log- 


5 ’ 6e‘ 

solution f such that the resulting mixture satisfies 


f-t 


With probability > 1 — 25, Learn returns a 


< 74e. 


Proof. We first show that A s ' has columns corresponding to all the components r, such that 
w r > e/k. For a mixture / let fi be the projection of / on coor dinate i. Note that fi(Iij) = ei Vi, t. 
Therefore by Dvoretzky-Kiefer-Wolfowitz theorem (see, Massart 199(j |l and the union bound if 
n > -4 log with probability >1 — 5, 

e i 

) < ei + ei < 2ei Vi, t. 

Since fi = Ylr=i w rPr,i, with probability >1 — 5, 

— — Vi, r. 

W r 

If w r > e/k, then p r ,i{h,t ) < 2e 2 /d and thus 24‘ s, contains all the underlying components r such 
that rtv > e/k. Let w* be the weights corresponding to components such that w r > e/k. Therefore 
||u;*||i > 1 — e. Furthermore by Lemma [331 b(S) > f s '( 1 — 3e) > (1 — 3 e)(A s 'w*)(S). Therefore, 


we have ||6|| = 


A s ' 


w / u> 


and 


6(5) > (1 - 3 e){A s 'w*)(S) ||u;*|| / ||u;*|| 

> (1 — 4e)(A s 'w*/ ||^*||)(5). 

Hence, By Theorem 12.11 algorithm returns a solution A s 'w' such that 
by Lemma [3751 Y/s&S' \(A S 'w')(S) — f s '(S)\ < 70e. Let / be the estimate corresponding to solution 


A s 'w' — b 


< 64e. Thus 


F - f 


S' 


< 70e. 


w'. Since f s and f s are flat within sets S, we have 

Since S' and S differ only in the set U and by Lemma [3751 f s (U) = f s '(U ) < e/(l — 3e), we 
have 


f-t 


< 


f - f 


+ 2 f s (U) < 74e. 


Note that the total error probability is < 25. 


□ 
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We now prove that if f s is close to / , then f is close to /. We first prove that flattened 
Gaussians in one dimension are close to their corresponding underlying Gaussian. 

Lemma 3 . 7 . Let p be a one dimensional Gaussian distribution and X = (/ 1 2, • • •) be a partition 
of the real line such that V/ G X, I is a continuous interval and p(I) < e. Then 

||p - p X \\ l < 30-v/e. 

Proof. If p and X are simultaneously scaled or translated, then the value of \\p — P X \\ l remains 
unchanged. Hence proving the lemma for p = N( 0 , 1 ) is sufficient. We first divide X into Xi,X2,X3 
depending on the minimum and maximum values of p(x) in the corresponding intervals. 


I e 


The l\ distance between p and p x is 


Xi 

if min xGl p(x) > ^e/( 2 n), 

T 2 

if inax xg / p(x) < yje/ ( 27 r) 

u 

else. 

is 

P 1 111 

= yy / p(x)— 


l£l 


> xSLl 


We bound the above summation by breaking it into terms corresponding to Xi,X2, and X3 respec¬ 
tively. Observe that |XT 3 1 < 2 and p(I) < eV/ € X3. Hence, 


/ \p(x) — p x (x)\dx < 2e. 

1 &X 3 


Since max xG j p(x) for every interval in I € X 2 is < y/e/( 2 n), by Gaussian tail hou n ds 

Y / Ip(*) ~P X (x)\dx < Y / P(x)dx 

JxPl JXPl 


I£l2 


/ex 2 

<V~e. 


For every interval / € X1 we hrst bound its interval length and maximum value of p'(x). Note that 

p(I) > |L| minp(y). 

y€l 

In particular since p(I) < e and min y& jp(y) > y/e/( 2 n), |/| < V 27 re. Let s = max l6 / \p\x)\. 


s = max|j/(x)| = max _ e x < max|x| • rnaxp(x). 

x£l x€l y/2ir X &I x&l 

Since min y& ip(y) > e/( 27 t), we have max yg / \y\ < -y/log 1 /e. Let 7/1 = argrna x. yeI p(y) and 
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y 2 = argmin yeI p{y), then 


maxyg !p(y) = p{y 2 ) 
min v eip(y) p(y 1 ) 

= e {yl-yl )/2 

— e (yi-y2)(y2+yi)/2 

< e l J l 

< g^TTElogi 

< 5. 


Since x) = p(/)/|/|, by Rolle’s theorem 3xo such that ^(x) = p(x o) Vx. By first order Taylor’s 
expansion, 

/ |p(s) -/(s)|dx < / |(x-x 0 ) max |p'(y)|dx 

lie; he; j/e[x 0 ,x] 

; / |x — Xo|(ix 
Jxei 
2 


< S 




< V2irep(I) max |x| • 


maxjg/ p(y) 


iG/ ‘ min ye /p(y) 
< 5V2TTep(I) max \x\, 


xE/ 


where the last three inequalities follow from the bounds on |/|, s, and mkL V gjp(y) respectively. Thus, 



|p(x) — p x (x)|dx < 5\/27 Tep(I) max |x 

X&I 


< 



p(x)(\x\ + y/e)dx. 


Summing over I € 2i, we get the above summation is < 5V2vre(l+y / e). Adding terms corresponding 
to Xi,X 2 , and X 3 we get 


p — P X \\ 1 < 5v / 27re(l + ^/e) + \/i + 2e < 30y/e. 


□ 


Using the above lemma we now show that for every d-dimensional (I \, X 2 • • .2^), e-flat Gaussian 
is close to the unflattened one. 

Lemma 3.8. For every (. Ii,I 2 , ■ ■ ■ Td), e-good, axis-aligned Gaussian distribution p = pi xp 2 x ... p c ], 
we have 

||p-p‘ S || 1 < 30 dy/e. 
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Proof. By triangle inequality, the distance between any two product distributions is upper bounded 
by the sum of distances in each coordinate. Hence, 


d 

l|P~-P‘ S Hl ^ X \\ Pi ~ p i i 

i— 1 

where the second inequality follows from Lemma 13.71 


< 30 dy/e, 


□ 


We now have all the tools to prove the main result on Gaussian mixtures. 


Theorem 3.9. Let e\ = e 3 /kd and n > max 


log- 


Then given 2 n samples from 


an axis-aligned Gaussian mixture f, with probability > 1 — 25, Learn returns an estimate mixture f 
with at most 0(k/e 3 ) components such that 




< 170e. 


The run time of the algorithm is O (l/e\) d . 
Proof. By triangle inequality, 


/-/ 


< 


f S ~ f 


+ 


f S ~ f 


1 + ll/ 5 -/Hi- 


We now bound each of the terms above. By Lemma 13.61 the first term is < 74e. By triangle 
inequality for / = Ylr=i ^ rPr , 


r-f 


< 


E 

r =1 


w r \\p r 


— Pt- 11! E 30\/2e, 


where the last inequality follows from the fact that the allowed distributions in A s ' are {1\ .T-^- ■ ■ ■ Td), 2e 2 /d- 
good and by Lemma 13.81 By triangle inequality, 


/ lll< 

llPr 

-Pr\h 


r= 1 


< 

X w A 

| Pf ~Pr | 

1 

'•:w r >e/k 


< 

X Wr 

||pf - Pr 


r:tt; r >e/fc 



I Pr 


Pr 


1+ X] w r 

r:w r <e/k 

Pr\\ x + 2e 


< 30V2e + 2e. 

where the last inequality follows from the proof of Lemma f3.6l where we showed that heavy compo¬ 
nents are (1\. I 2 , • ■ - 2d), 2e 2 /d-good and by Lemma [3.81 Summing over the terms corresponding to 
f s — f , and || f s — f || 15 we get the total error as 74e + 30\/2e + 30v / 2e + 2e < 170e. 


f s -r 


he error probability and the number of samples necessary are same as that of Lemma 13.61 The 
run time follows from the comments in Section 13.21 and the bound on number of samples. □ 

If we consider the leading term in sample complexity, for d = 1 our bound is 0(k 2 /e e ), 
and for d > 1, ou r bou nd is (D((kd) d /e 3d+ ^). While this is not the optimal sample complexity 
(see Acharva et al. 2014J] 4. we gain significantly in the running time. 
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4 Lower bounds 


We now investigate lower bounds towards obtaining sparse approximate solutions to nonnegative 
systems. Our first result is that unless V = MV , we need to lose a factor at least log(l/e) in the 
sparsity to be e-close in the £\ norm. Formally, 


Theorem 4.1. For any e > 0, given an instance of the sparse recovery problem A, b that is promised 
to have a k-sparse nonnegative solution, it is NV-hard to obtain an o (A:In (^)) -sparse solution x a i g 
with ||Axaig - < e ll&llp 


Our second result gives a connection to a random planted version of the set cover problem, which 
is beyond all known algorithmic techniques. We prove that unless this planted set cover problem 
can be solved efficiently, we cannot hope to obtain an e- approximate solution with sparsity o(k/e 2 ) 

_ T heoreml4.1lis inspired by t he hard instances of Max k-Cover problem Feige, 19981 . Feige et al. 

2004, Feige and Vondrak . 201fll |. 


Hard Instances of Max A:- Cover. For any c > 0, and 6 > 0, given a collection of n sets 
S\, S< 2 , ■ ■ ■, S n C [m], it is ATP-Hard to distinguish between the following two cases: 

• Yes case: There are k disjoint sets in this collection whose union is [m\. 

• No case: The union of any £ < ck sets of this collection has size at most (1 — (1 — ^ Y + 5)n. 

Proof outline, Theorem \ f.l\ We reduce hard instance of the Max &:-cover problem to our problem 
as follows. For each set Si, we set Ai (column i in A) to be the indicator vector of set Si. We also 
let b to be the vector with all entries equal to one. 

In the Yes case, we know there are k disjoint sets whose union is the universe, and we construct 
solution x* as follows. We set x* (the ith entry of x*) to one if set Si is one of these k sets, and 
zero otherwise. It is clear that Ax* is equal to b, and therefore there exists a /c-sparse solution in 
the Yes case. 

On the other hand, for every e-approximate non-negative solution x, we know that the number 
of non-zero entries of Ax is at most em by definition. Define C to be the sub-collection of sets with 
non-zero entry in x, i.e. {Si \ Xi > 0}. We know that each non-zero entry in Ax is covered by some 
set in sub-collection C. In other words, the union of sets in C has size at least (1 — e)m. We imply 
that the number of sets in collection C should be at least D(A; ln(^-pj)) since (1 — \) k is in range 

i]. We can choose 6 to be e, and therefore the sparsest solution that one can find in the No case 
is fl(k ln(i))-sparse. Assuming V ^ MV, it is not possible to find a o(k In j)-sparse e-approximate 
solution when there exists a fc-sparse solution, otherwise it becomes possible to distinguish between 
the Yes and No cases of the Max A:-Cover problem in polynomial time. □ 

Finally, we show that unless a certain variant of set cover can be solved efficiently, we cannot 
hope to obtain an e-approximate solution with sparsity o(k/e 2 ). We will call this the planted set 
cover problem: 

Definition 4.2. (Planted set cover ( k,m ) problem) Given parameters m and k > m 3 / 4 , find an 
algorithm that distinguishes with probability > 2/3 between the following distributions over set 
systems over m elements and n = 0{m/ log m) sets: 

No case: The set system is random, with element i in set j with probability 1/k (independently). 
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Yes case: We take a random set system with n — k sets as above, and add a random k-partition 
of the elements as the remaining k sets. (Thus there is a perfect cover using k sets.) 

To the best of our knowledge, none of the algorithmic techniques developed in the context of set 
cover can solve this distinguishing problem. The situation is similar in spirit to the pl anted cli que 


and planted dense subgraph prob l ems o n random graphs, as well as random 3-SAT Alon et al. 


1998, 

Bhaskara et al.. 201C. 

Feiee. 

to 

o 

o 


2010i . [Fe ige. 120021 ]. This shows that obtaining sparse approximate solutions 


with sparsity o(k/e 2 ) requires significantly new techniques. Formally, we show the following 

Theorem 4.3. Let m 3 / 4 < k < m/ log 2 m. Any algorithm that finds an o(k/e 2 ) sparse e- 
approximate solution to non-negative linear systems can solve the planted set cover {k, m) problem. 

Proof. Let n (which is 0(m/ log m)) be the number of sets in the set system. Let A be the m x n 
matrix whose i,j th entry is 1 if element i is in set j, and 0 otherwise. It is clear that in the Yes 
case, there exists a solution to Ax = 1 of sparsity k. It suffices to show that in the No case, there 
is no e-approximate solution to Ax = 1 with fewer than Pl{k/e 2 ) entries. 

Let us define C = 1/e 2 , f or convenience. The proof follows the standard template in random 


matrix theory (e.g. Ir.u riel son and Vershvninl (201 (ll ]l: we show that for any fixed C£;-sparse vector 
x, the probability that || Ax — 1)^ < 1/(4 \/C) is tiny, and then take a union bound over all x in a 
fine enough grid to conclude the claim for all k- sparse x. 

Thus let us fix some Ck sparse vector x and consider the quantity \\Ax — l^. Let us then 
consider one row, which we denote by y, and consider \{y,x) — 1|. Now each element of y is 1 with 
probability 1/k and 0 otherwise (by the way the set system was constructed). Let us define the 
mean-zero random variable W % , 1 < i < n, as follows: 


Wi = 



with probability 1/k, 
otherwise. 


We first note that E[|(y,x) — 1| 2 ] > E[QY W{Xi ) 2 ]. This follows simply from the fact that for any 
random variable Z, we have E[|Z—1| 2 ] > E[|Z’ — E[Z]| 2 ] (i.e., the best way to “center” a distribution 
with respect to a least squares objective is at its mean). Thus let us consider 


E 


T, w ‘ 


Xi 




k> 


Since x is Cfc-sparse, and since ||cc|| x > 3k/4, we have x 2 > ^ • ||x|| 2 > k/2C. Plugging this 
above and combining with our earlier observation, we obtain 


E[|(y,x)-l| 2 ] >E 




> 


1 

3 C‘ 


(7) 


Now we will use the Paley-Zygmund inequality]^ with the random variable Z := | (y,x) — 1| 2 . For 
this we need to upper bound E[Z 2 ] = E[|(y, x) — 1| 4 ]. We claim that we can bound it by a constant. 


4 For any non-negative random variable Z , we have Pr(Z > #E[i7]) > (1 — 6) 2 ■ 
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Now since ||a:]| x is between 3fc/4 and 5fc/4, we have | (y,x) — JA WiXi\ < 1/2. This in turn implies 
that E [Z 2 ] < 4(E[(£j WiXi ) 4 ] + 4). We will show that E[QA W^) 4 ] = 0(1). 


nCE = E w i x i + 3 E w i w \ 


2 2 2 
j X i X j 


i 1,3 

-fe'E^+pE 

* 


2 2 
x i x j 


— 1 + TT ■ (E x i ) 2 “ 


k 2 

l 

Here we used the fact that we have 0 < Xi < 1 for all i, and that x i — x i — 5^/4. 
This implies, by using the Paley-Zygmund inequality, that 


Pr 


I (y,x) - i| < 


l 


4 y/C. 


< 1 - 1 / 10 . 


( 8 ) 


Thus if we now look at the m rows of A, and consider the number of them that satisfy \(y, x) — 
1| < 1/(4 \/~C), the expected number is < 9m/10, thus the probability that there are more than 
19m/20 such rows is exp(—fl(m)). Thus we have that for any Cfc-sparse x with Hx)^ € [3/c/4, 5fc/4] 


and ||x|| < 1, 


Pr 


11.Ax — l|| x < 


1 


< e 


—m/40 


(9) 


80 VC_ 

Now let us construct an e'-neiH for the set of all Ofc-sparse vectors, with e' = 1/m 2 . A simple 
way to do it is to first pick the non-zero coordinates, and take all integer multiples of e'/m as the 
coordinates. It is easy to see that this set of points (call it Af) is an e' net, and furthermore, it has 
size roughly 

' m 


Ck 


^) Ck = 0(m^ 


Thus as long as m > 200Cklogm, we can take a union bound over all the vectors in the e' net, 
to conclude that with probability we have 


11 Ace — 11| ! > 


80 VC 


for all x € A f. 


In the event that this happens, we can use the fact that A f is an e' net (with e' = 1/m 2 ), to conclude 
that ||Ax — lUj > f° r allCk- sparse vectors with coordinates in [0,1] and ||x|| x € [3fc/4,5/c/4], 

This completes the proof of the theorem, since is H(e), and k < m/log 2 m. □ 
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A Auxiliary lemmas 


The simple technical lemma we required in the proof is the following. 

Lemma A.l. Let 5 > 0, and f(t) := (1 + 5Y — 1. Then for any t\,t 2 > 0, we have 

f{h) + f{t 2 ) < f(t i +t 2 ). 


Proof. The proof follows immediately upon expansion: 

m + 1 2 ) - f(t i) - f(t 2 ) = ((i+ dyi -1) ((i + 5 y* - 1). 

The term above is non-negative because d,t\,t 2 are all >0. □ 


Lemma A. 2 (Averaging). Let {ai,bi }/ =1 be non-negative real numbers, such that 

Oj = A and b t = 1. 

i i 

Then for any parameter C > 1, there exists an index i such that bi > 1 /(Ck), and ai < bi■ A/( 1 — 
1/C). 

Proof. Let 5 := {i : 6* > \/{Ck)}. Now since there are only k indices, we have < 

k ■ 1 /(Ck) < 1/C, and thus 

J>>(l-1/C'). (10) 

ieS 

Next, since all the a* are non-negative, we get that 

eg < A. 

ieS 


Combining the two, we have 

Zagg a i A 
Z^igs bi 1 - 1/C 

Thus there exists an index i € S such that ai < bi ■ A/(l — 1/C) (because otherwise, we have 
ai > biA/( 1 — 1/C) for all i, thus summing over i € S, we get a contradiction to the above). This 
proves the lemma. □ 


Lemma A.3. For any 0 < x < 1 and 6 > 0, we have 

(1 + S) x < 1 + Sx < (1 + <5) x ( 1+<5 ). 

Proof. For any 0 < 9 < 5, we have 

1 „ 1 ^ 1+5 

1 + 0 < l + 0x < 1 + 0' 

The first inequality is because x < 1, and the second is because the RHS is bigger than 1 while the 
LHS is smaller. Now integrating from 0 = 0 to 0 = 5, we get 

log(l + 6 )< l0g(1 + X(5) < (1 + <5) log(l + <5). 
x 

Multiplying out by x and exponentiating gives the desired claim. □ 
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